Intro

In this project i’ve been given data from the Lifelines project. This project gave people who lived in the provinces of Drenthe, Groningen and Friesland a questionair that collected data on numerous catagories. These catagories include things like; Weight, Height, Finances, different diseases, different lifestyle choices, differences in social status (financials, degree, etc). These different catagories have been measured in different ways, some are measured according to external rule sets. For example, Data on sleep quality (section: lifestyle and environment) has been reviewed by researchers from the Erasmus MC who have developed a PSQI derivative for the Lifelines cohort. The variable included in the public health dataset is an indicator of people who experience ‘poor sleep quality’. They have given a score back being either 0 meaning bad sleep quality, or a score of 1 meaning good sleep quality. The task at hand will be creating a nice and sleek datadashboard that display’s certain catagories that are important. The catagories that we’re found to be important in this research will be lifestyle catagories. This will include; METABOLIC_DISORDER, BURNOUT, DEPRESSION, MWK_VAL / SPORTS_T1, SLEEP_QUALITY, SMOKING, SUMOFALCOHOL, SUMOFKCAL, DBP (Diastolic Blood Pressure in mm hg at baseline), HBF (Pulse rate in beats per minute at baseline), FINANCE, BMI, WEIGHT, HEIGHT. These different factors in lifestyle can be viewed in different ways; the datadashboard will have a standard barplot and a interactive barplot these plots can be filtered in different ways through a sidebar. This sidebar will have filters in what provinces are shown, What genders are shown and what age range is show (later financial situation also will be shown). There will be 2 plots a hexbin plot and a barplot (switching between this can be done with a dropdown box). The purpose of this datadashboard is for the user to see what different lifestyles are common in different provinces but also what province is the healthiest or wealthiest this can give them a general idea of what province is benificial for them. It can also influence people to think of certain research that needs to be done. For example if the data shows that the province of Drenthe has an absurd spike in people with lung isues compared to the other provinces it could lead questions about industry and air quality. In this logbook there will be an analysis of the data and a decision about what data to use and what graph to use and how these things will be shown on the data dashboard

Data analysis, filtering and sorting

19-nov-2024

## here() starts at /Users/jarnoduiker/github_bioinf/Lifelines_datadashboard

20-nov-2024

(T = time stamp)

#ik gebruik hier describe om verschillende statistieken te bekijken van mijn data dit komt uit de psych package.
describe(data_lifelines)
##                        vars     n    mean      sd  median trimmed     mad
## GENDER                    1 16696    1.59    0.49    2.00    1.61    0.00
## BIRTHYEAR                 2 16696 1963.88   11.16 1964.00 1963.77   10.38
## AGE_T1                    3 16696   46.62   11.18   47.00   46.67   10.38
## AGE_T2                    4 16696   50.39   11.13   51.00   50.48   10.38
## AGE_T3                    5 16696   56.40   11.16   57.00   56.51   10.38
## ZIP_CODE                  6 16696 9088.02  703.59 9281.00 9150.68  637.52
## BMI_T1                    7 16696   25.95    4.11   25.40   25.59    3.56
## WEIGHT_T1                 8 16696   79.58   14.72   78.00   78.69   14.83
## HIP_T1                    9 16696   99.18    9.18   98.00   98.63    7.41
## HEIGHT_T1                10 16696  174.92    9.35  174.00  174.72   10.38
## WAIST_T1                 11 16696   90.15   11.80   90.00   89.70   11.86
## BMI_T2                   12 16646   26.07    4.21   25.50   25.70    3.71
## WEIGHT_T2                13 16646   79.69   14.87   78.00   78.83   14.83
## HIP_T2                   14 16646   99.76    9.15   99.00   99.17    7.41
## HEIGHT_T2                15 16646  174.66    9.43  174.00  174.46   10.38
## WAIST_T2                 16 16646   90.10   12.15   89.00   89.64   11.86
## HEIGHT_T3                17 16696  174.22    9.51  173.50  174.03   10.38
## WEIGHT_T3                18 16696   81.85   15.66   80.30   80.89   15.27
## HIP_T3                   19 16693  102.20    9.59  101.00  101.48    7.41
## WAIST_T3                 20 16696   92.35   13.79   92.00   91.81   13.34
## EDUCATION_LOWER_T1       21 16666    0.25    0.44    0.00    0.19    0.00
## EDUCATION_LOWER_T2       22 15215    0.26    0.44    0.00    0.20    0.00
## FINANCE_T1               23 16696    6.96    2.62    7.00    7.23    2.97
## WORK_T1                  24 15077    0.74    0.44    1.00    0.80    0.00
## WORK_T2                  25 15218    0.77    0.42    1.00    0.84    0.00
## LOW_QUALITY_OF_LIFE_T1   26 16694    0.08    0.28    0.00    0.00    0.00
## LOW_QUALITY_OF_LIFE_T2   27 15230    0.10    0.30    0.00    0.00    0.00
## DBP_T1                   28 16690   74.12    9.28   73.00   73.70    8.90
## DBP_T2                   29 16631   74.13    9.39   73.00   73.75    8.90
## HBF_T1                   30 16690   70.62   10.74   70.00   70.27   10.38
## HBF_T2                   31 16631   68.54   11.11   68.00   68.09   10.38
## MAP_T1                   32 16688   93.45   10.01   92.00   92.80    8.90
## MAP_T2                   33 16631   94.53   10.36   93.00   93.85    8.90
## SBP_T1                   34 16690  125.71   14.92  124.00  124.92   14.83
## SBP_T2                   35 16631  128.18   15.91  127.00  127.31   16.31
## HTN_MED_T1               36 16630    0.11    0.31    0.00    0.01    0.00
## CHO_T1                   37 16615    5.14    0.99    5.10    5.10    1.04
## GLU_T1                   38 16557    5.00    0.78    4.90    4.92    0.44
## CHO_T2                   39 16234    5.12    0.97    5.10    5.09    1.04
## GLU_T2                   40 16121    5.05    0.83    4.90    4.95    0.44
## RESPIRATORY_DISEASE_T1   41 16655    0.08    0.27    0.00    0.00    0.00
## SMOKING                  42 16438    0.17    0.37    0.00    0.09    0.00
## METABOLIC_DISORDER_T1    43 16694    0.00    0.43    0.00    0.00    0.00
## METABOLIC_DISORDER_T2    44 16694    0.02    0.46    0.00    0.00    0.00
## LLDS                     45 15017   24.88    6.08   25.00   24.86    5.93
## SUMOFALCOHOL             46 11201    7.56    8.56    5.29    6.06    6.64
## SUMOFKCAL                47 11314 2010.83  628.25 1923.36 1961.05  533.65
## MWK_VAL                  48 15574  508.20  665.92  270.00  360.13  289.11
## SCOR_VAL                 49 15578 2654.12 3353.85 1500.00 1931.82 1630.86
## MWK_NO_VAL               50 15576  279.49  303.19  200.00  229.50  207.56
## SCOR_NO_VAL              51 15576 1502.57 1576.97 1080.00 1251.83 1156.43
## SPORTS_T1                52 15578    0.59    0.49    1.00    0.61    0.00
## CYCLE_COMMUTE_T1         53 14573    0.41    0.49    0.00    0.39    0.00
## VOLUNTEER_T1             54 14030    0.33    0.47    0.00    0.29    0.00
## PREGNANCIES              55  9167    1.92    1.19    2.00    1.90    1.48
## OSTEOARTHRITIS           56 16696    0.08    0.27    0.00    0.00    0.00
## BURNOUT_T1               57 16696    0.09    0.29    0.00    0.00    0.00
## DEPRESSION_T1            58 16696    0.10    0.29    0.00    0.00    0.00
## SLEEP_QUALITY            59  8118    0.35    0.48    0.00    0.32    0.00
## DIAG_CFS_CDC             60 14374    0.03    0.17    0.00    0.00    0.00
## DIAG_FIBROMYALGIA_ACR    61 14213    0.06    0.24    0.00    0.00    0.00
## DIAG_IBS_ROME3           62 14382    0.05    0.23    0.00    0.00    0.00
## C_SUM_T1                 63 15484   29.81    3.36   30.00   29.89    2.97
## A_SUM_T1                 64 15497   18.39    4.28   18.00   18.25    4.45
## SC_SUM_T1                65 15530   19.64    4.65   19.00   19.43    4.45
## I_SUM_T1                 66 15463   22.03    3.92   22.00   21.96    4.45
## E_SUM_T1                 67 15494   21.79    4.60   22.00   21.76    4.45
## SD_SUM_T1                68 15536   29.46    4.27   30.00   29.64    2.97
## V_SUM_T1                 69 15559   18.13    4.09   18.00   17.94    4.45
## D_SUM_T1                 70 15548   28.58    4.10   29.00   28.78    4.45
## LTE_SUM_T1               71 16344    1.00    1.22    1.00    0.79    1.48
## LDI_SUM_T1               72 16308    2.37    2.29    2.00    2.04    1.48
## LTE_SUM_T2               73 14947    0.79    1.09    0.00    0.60    0.00
## LDI_SUM_T2               74 15001    2.06    2.25    1.00    1.69    1.48
## NSES_YEAR                75 16696 2009.36    1.46 2010.00 2009.70    0.00
## NSES                     76 16200   -0.58    1.08   -0.58   -0.55    1.04
## NEIGHBOURHOOD1_T2        77 11747    8.22    1.46    8.00    8.38    1.48
## NEIGHBOURHOOD2_T2        78 11806    1.94    0.82    2.00    1.87    1.48
## NEIGHBOURHOOD3_T2        79 11812    1.45    0.68    1.00    1.34    0.00
## NEIGHBOURHOOD4_T2        80 11810    1.76    1.04    1.00    1.56    0.00
## NEIGHBOURHOOD5_T2        81 11809    3.69    1.02    4.00    3.78    1.48
## NEIGHBOURHOOD6_T2        82 11812    4.08    0.81    4.00    4.17    0.00
## MENTAL_DISORDER_T1       83 16320    0.08    0.32    0.00    0.00    0.00
## MENTAL_DISORDER_T2       84 13472    0.09    0.33    0.00    0.00    0.00
##                            min      max    range   skew kurtosis    se
## GENDER                    1.00     2.00     1.00  -0.35    -1.88  0.00
## BIRTHYEAR              1927.00  1995.00    68.00   0.08    -0.21  0.09
## AGE_T1                   18.00    84.00    66.00  -0.02    -0.25  0.09
## AGE_T2                   20.00    88.00    68.00  -0.05    -0.21  0.09
## AGE_T3                   25.00    95.00    70.00  -0.07    -0.18  0.09
## ZIP_CODE               1015.00  9998.00  8983.00  -1.41     7.47  5.45
## BMI_T1                   15.40    53.80    38.40   1.18     2.95  0.03
## WEIGHT_T1                42.00   158.00   116.00   0.68     0.86  0.11
## HIP_T1                   62.00   185.00   123.00   0.93     3.03  0.07
## HEIGHT_T1               137.00   207.00    70.00   0.18    -0.40  0.07
## WAIST_T1                 60.00   156.00    96.00   0.47     0.60  0.09
## BMI_T2                   13.00    54.50    41.50   1.18     2.91  0.03
## WEIGHT_T2                43.50   160.00   116.50   0.65     0.70  0.12
## HIP_T2                   68.00   192.50   124.50   0.99     3.19  0.07
## HEIGHT_T2               116.50   206.00    89.50   0.16    -0.27  0.07
## WAIST_T2                 52.00   155.00   103.00   0.44     0.41  0.09
## HEIGHT_T3               108.00   208.50   100.50   0.12     0.03  0.07
## WEIGHT_T3                36.70   168.30   131.60   0.72     1.07  0.12
## HIP_T3                    0.00   175.00   175.00   0.74     6.01  0.07
## WAIST_T3                 11.00   761.00   750.00   7.17   330.41  0.11
## EDUCATION_LOWER_T1        0.00     1.00     1.00   1.13    -0.73  0.00
## EDUCATION_LOWER_T2        0.00     1.00     1.00   1.12    -0.75  0.00
## FINANCE_T1                1.00    10.00     9.00  -0.75    -0.40  0.02
## WORK_T1                   0.00     1.00     1.00  -1.11    -0.77  0.00
## WORK_T2                   0.00     1.00     1.00  -1.27    -0.38  0.00
## LOW_QUALITY_OF_LIFE_T1    0.00     1.00     1.00   3.03     7.19  0.00
## LOW_QUALITY_OF_LIFE_T2    0.00     1.00     1.00   2.63     4.90  0.00
## DBP_T1                   47.00   143.00    96.00   0.54     0.70  0.07
## DBP_T2                   43.00   128.00    85.00   0.43     0.26  0.07
## HBF_T1                   31.00   147.00   116.00   0.41     0.73  0.08
## HBF_T2                   34.00   142.00   108.00   0.47     0.64  0.09
## MAP_T1                    0.00   160.00   160.00   0.68     2.16  0.08
## MAP_T2                   65.00   151.00    86.00   0.67     0.61  0.08
## SBP_T1                   72.00   221.00   149.00   0.64     1.03  0.12
## SBP_T2                   86.00   213.00   127.00   0.57     0.40  0.12
## HTN_MED_T1                0.00     1.00     1.00   2.49     4.19  0.00
## CHO_T1                    2.10     9.50     7.40   0.36     0.18  0.01
## GLU_T1                    2.70    22.10    19.40   6.05    85.10  0.01
## CHO_T2                    2.10     9.80     7.70   0.36     0.31  0.01
## GLU_T2                    2.80    20.60    17.80   4.87    48.42  0.01
## RESPIRATORY_DISEASE_T1    0.00     1.00     1.00   3.08     7.50  0.00
## SMOKING                   0.00     1.00     1.00   1.77     1.12  0.00
## METABOLIC_DISORDER_T1    -9.00     1.00    10.00 -17.24   349.06  0.00
## METABOLIC_DISORDER_T2    -9.00     1.00    10.00 -15.03   288.83  0.00
## LLDS                      4.00    46.00    42.00   0.02    -0.20  0.05
## SUMOFALCOHOL              0.00    76.49    76.49   1.89     5.11  0.08
## SUMOFKCAL                 4.18  7460.75  7456.57   1.28     4.90  5.91
## MWK_VAL                   0.00  6227.00  6227.00   2.48     7.25  5.34
## SCOR_VAL                  0.00 33960.00 33960.00   2.41     7.07 26.87
## MWK_NO_VAL                0.00  4450.00  4450.00   2.96    16.92  2.43
## SCOR_NO_VAL               0.00 26570.00 26570.00   2.70    16.57 12.64
## SPORTS_T1                 0.00     1.00     1.00  -0.37    -1.86  0.00
## CYCLE_COMMUTE_T1          0.00     1.00     1.00   0.36    -1.87  0.00
## VOLUNTEER_T1              0.00     1.00     1.00   0.71    -1.49  0.00
## PREGNANCIES               0.00     9.00     9.00   0.19     0.62  0.01
## OSTEOARTHRITIS            0.00     1.00     1.00   3.05     7.31  0.00
## BURNOUT_T1                0.00     1.00     1.00   2.84     6.06  0.00
## DEPRESSION_T1             0.00     1.00     1.00   2.76     5.62  0.00
## SLEEP_QUALITY             0.00     1.00     1.00   0.61    -1.62  0.01
## DIAG_CFS_CDC              0.00     1.00     1.00   5.53    28.61  0.00
## DIAG_FIBROMYALGIA_ACR     0.00     1.00     1.00   3.70    11.70  0.00
## DIAG_IBS_ROME3            0.00     1.00     1.00   3.92    13.40  0.00
## C_SUM_T1                 12.00    40.00    28.00  -0.31     0.84  0.03
## A_SUM_T1                  8.00    38.00    30.00   0.35     0.21  0.03
## SC_SUM_T1                 8.00    40.00    32.00   0.43     0.14  0.04
## I_SUM_T1                  8.00    38.00    30.00   0.17     0.06  0.03
## E_SUM_T1                  8.00    39.00    31.00   0.06    -0.22  0.04
## SD_SUM_T1                11.00    40.00    29.00  -0.51     0.70  0.03
## V_SUM_T1                  8.00    38.00    30.00   0.55     0.86  0.03
## D_SUM_T1                 12.00    40.00    28.00  -0.49     0.45  0.03
## LTE_SUM_T1                0.00    11.00    11.00   1.66     4.24  0.01
## LDI_SUM_T1                0.00    17.00    17.00   1.40     2.51  0.02
## LTE_SUM_T2                0.00    12.00    12.00   2.29    10.93  0.01
## LDI_SUM_T2                0.00    23.00    23.00   2.07     8.60  0.02
## NSES_YEAR              2006.00  2010.00     4.00  -1.86     1.47  0.01
## NSES                     -7.12     2.93    10.05  -0.30     0.76  0.01
## NEIGHBOURHOOD1_T2         1.00    10.00     9.00  -2.02     7.04  0.01
## NEIGHBOURHOOD2_T2         1.00     5.00     4.00   0.74     0.56  0.01
## NEIGHBOURHOOD3_T2         1.00     5.00     4.00   1.87     5.01  0.01
## NEIGHBOURHOOD4_T2         1.00     5.00     4.00   1.35     1.05  0.01
## NEIGHBOURHOOD5_T2         1.00     5.00     4.00  -0.67     0.12  0.01
## NEIGHBOURHOOD6_T2         1.00     5.00     4.00  -1.23     2.58  0.01
## MENTAL_DISORDER_T1        0.00     4.00     4.00   4.92    30.49  0.00
## MENTAL_DISORDER_T2        0.00     5.00     5.00   4.73    27.68  0.00

Here I use describe, it says the following things.

vars notes the variable index.

n is the number of values.

mean is the average.

sd is the standard deviation.

median is the middle value.

trimmed is the mean after trimming 10% of the observations from each tail.

mad is the median of the absolute deviation.

min and max are the minimum and maximum values.

range is the difference between the maximum and the minimum.

skew is the skewness of the distribution. (between -1 & +1 is perfect between -2 and +2 is acceptable) Hair, J.F., Hult, G.T.M., Ringle, C.M., & Sarstedt, M. (2022). A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM) (3 ed.). Thousand Oaks, CA: Sage.

kurtosis is the measure of the ‘tailiness’ of the distribution.

se is the standard error.

DATA ANALYSIS The headers need to be changed because it is now unreadable without a codebook, There are many NA’s between measurements. Furthermore, the csv can be loaded well and there are no problems with it.

Postal codes

Here i’ve found all postal codes for each province. This is necasary to make a new catagory in the dataframe that shows from what province each participant is. These zipcodes are put in lists. Also the dataframe will be mutated (this is where the new column is made).

Friesland postalcode 8388 - 9299 + 9850 - 9859

friesland_zipcodes <- c(
  8401:8409, 8411:8417, 8421:8429, 8431:8439, 8441:8449,
  8451:8459, 8461:8469, 8471:8479, 8481:8489, 8491:8499,
  8501:8509, 8511:8519, 8521:8529, 8531:8539, 8541:8549,
  8551:8559, 8561:8569, 8571:8579, 8581:8589, 8591:8599,
  8601:8609, 8611:8619, 8621:8629, 8631:8639, 8641:8649,
  8651:8659, 8661:8669, 8671:8679, 8681:8689, 8691:8699,
  8701:8709, 8711:8719, 8721:8729, 8731:8739, 8741:8749,
  8751:8759, 8761:8769, 8771:8779, 8781:8789, 8791:8799,
  8801:8809, 8811:8819, 8821:8829, 8831:8839, 8841:8849,
  8851:8859, 8861:8869, 8871:8879, 8881:8889, 8891:8899,
  9001:9009, 9011:9019, 9021:9029, 9031:9039, 9041:9049,
  9051:9059, 9061:9069, 9071:9079, 9081:9089, 9091:9099,
  9101:9109, 9111:9119, 9121:9129, 9131:9139, 9141:9149,
  9151:9159, 9161:9169, 9171:9179, 9181:9189, 9191:9199,
  9201:9209, 9211:9219, 9221:9229, 9231:9239, 9241:9249,
  9251:9259, 9261:9269, 9271:9279, 9281:9289, 9291:9299
)

groningen_zipcodes <- c(
  2750:2752, 2760:2761, 2811, 2840:2841, 2910:2914,
  5340:5359, 5366:5368, 5370:5371, 5373, 5386, 5394:5398,
  9350:9351, 9354:9356, 9359, 9361:9367, 9479, 
  9500:9503, 9540:9541, 9545, 9550:9551, 9560:9561, 9563, 9566, 
  9580:9581, 9584:9585, 9591, 9600:9611, 9613:9629, 
  9631:9633, 9635:9636, 9640:9649, 9651, 9661, 9663, 9665, 
  9670:9675, 9677:9679, 9681:9688, 9691, 9693, 9695:9704, 
  9711:9718, 9721:9728, 9731:9738, 9741:9747, 9750:9756, 
  9771, 9773:9774, 9790:9798, 9800:9805, 9811:9812, 
  9821:9825, 9827:9828, 9831:9833, 9841:9845, 
  9860:9866, 9881:9886, 9891:9893, 9900:9915, 
  9917:9925, 9930:9934, 9936:9937, 9939, 9942:9949, 
  9951, 9953:9957, 9961:9969, 9970:9999
)

drenthe_zipcodes <- c(
  3925, 7705, 7740:7742, 7750:7751, 7753:7756, 7760:7761, 7764:7766, 
  7800:7801, 7811:7815, 7821:7828, 7830:7831, 7833, 7840:7849, 7851:7856, 
  7858:7859, 7860:7864, 7871:7877, 7880:7881, 7884:7885, 7887, 7889:7892, 
  7894:7895, 7900:7918, 7920:7929, 7931:7938, 7940:7944, 7946, 7948:7949, 
  7957:7958, 7960:7966, 7970:7975, 7980:7986, 7990:7991, 8066, 8325:8326, 
  8334:8339, 8341:8347, 8351, 8355:8356, 8361:8363, 8371:8378, 8380:8398, 
  8420:8428, 8430:8435, 8437:8439, 8470:8479, 8481:8489, 9300:9307, 
  9311:9315, 9320:9321, 9330:9337, 9341:9343, 9351, 9400:9423, 9430:9439, 
  9441:9449, 9450:9469, 9470:9475, 9480:9489, 9491:9497, 9511:9512, 9514:9515, 
  9520:9528, 9530:9537, 9564, 9571, 9573:9574, 9654:9659, 9749, 9760:9761, 
  9765:9766, 9780:9785, 9959
)

#Here the dataframe is mutated by adding a new column for what province each participant is from. Using case when this works by looking into the ZIP_CODE column what number it is, if its in the friesland_zipcodes it will put down Friesland in the Province column if its not in friesland_zipcodes it will go to the next option.

data_lifelines <- data_lifelines %>%
    mutate(Province = case_when(ZIP_CODE %in% friesland_zipcodes ~ "Friesland", 
                             ZIP_CODE %in% groningen_zipcodes ~ "Groningen",
                             ZIP_CODE %in% drenthe_zipcodes ~ "Drenthe"))


#here the datalifelines is headed and only the province column to see if it worked
head(data_lifelines$Province)
## [1] "Groningen" "Friesland" "Friesland" "Drenthe"   "Friesland" NA
#Here i get gender from the dataframe, with the %>% i transform the 1 or 2 option to Male or Female
GENDER1 <- data_lifelines$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))

#Here i use ggplot to look at the average alcohol consumption per gender in grams
ggplot(data_lifelines, aes(x=GENDER1, y=SUMOFALCOHOL, fill=GENDER1)) +
    geom_boxplot() +
    ylab("Sum of alcohol per week in grams") +
    xlab("Gender") +
    labs(title="How many grams of alchol does each gender average per week")
## Warning: Removed 5495 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

In this boxplot you can see that men average more gram of alochol per week then females do, however it’s not such a significante difference that it’s something to show in the datadashboard.

ggplot(data_lifelines, aes(x=GENDER1, y=SUMOFKCAL, fill=GENDER1)) +
    geom_boxplot() +
    xlab("Gender") +
    ylab("Sum of kcal - per day") +
    labs(title = "Sum of kcal - per day, per gender")
## Warning: Removed 5382 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Here is a boxplot of the sum of kcal for each gender per day. As expected the average for men is around 2300-2400 and the average for females is around 1800-1900. Something notable is that there are no males that have a extreme low kcal when there are females that eat 0 or close to 0 kcal each day. This however could be a randomisation error. This is because to keep the data private the lifelines project

26-nov-2024

ggplot(data_lifelines, aes(x=WEIGHT_T1, y=HEIGHT_T1)) +
    geom_point(alpha= 0.2, size=.2) +
    xlab("Weight on measuring moment 1 (in KG)") +
    ylab("Height in cm") +
    labs(title = "Length and weight scatterplot")

In this scatterplot the height and weight is show in a scatterplot. due to the large amount of data the points had to become quite small and the alpha has been changed to very light. this plot does show a general area where the dots are. This is between 60-100kg and 150 and 190 cm. This plot is not very interesting and therefore will not be shown in the datadashboard.

 ggplot(data_lifelines, aes(x= PREGNANCIES)) +
    geom_bar(fill="#69b3a2", alpha=0.8) +
    xlab("Prengnancies") +
    labs(title = "How many pregnancies do women have")
## Warning: Removed 7529 rows containing non-finite outside the scale range
## (`stat_count()`).

In the lifelines form women have been asked how many pregnancies women had, this is displayed here. It seems like the most common thing is to have 2 pregnancies or two children then 3 then 0. This seems quite common and not necesarly notable for the dashboard.

ggplot(data_lifelines, aes(x=AGE_T1, y=WEIGHT_T1)) +
    geom_jitter(alpha= 0.2, size=.2)+
    ylab("Weight on measuring moment 1 (in KG)") +
    xlab("age") +
    labs(title = "Age and weight scatterplot")

In this scatterplot the weight and age are being compared and there is not a notable spike or difference in the ages. the only thing i see is allot of data points near the 50 yr age point but this is due to most contestants being this age.

27 nov making a subset for smokers vs non smokers

smokers_lifeline <- subset(data_lifelines, data_lifelines$SMOKING > 0)
non_smokers_lifeline <- subset(data_lifelines, data_lifelines$SMOKING == 0)

gender_smokers <- smokers_lifeline$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))


head(gender_smokers)
## [1] Female Female Male   Male   Female Male  
## Levels: Male Female
ggplot(smokers_lifeline, mapping = aes(x=gender_smokers, fill = factor(RESPIRATORY_DISEASE_T1))) + 
  geom_bar() +
    xlab("Gender") +
    ylab("Count of people") +
    labs(title = "Smoker's and non smokers for each gender", fill = "Smoking
         0 = Non-smoker
         1 = Smoker")

In this barplot the smokers per gender is shown. Interestingly there are a lot less smokers then expected. less then 250! This could be intersting to compare the ammount of smokers with the ammount of lung problems.

data_lifelines_gender <- data_lifelines$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))

ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(data_lifelines_gender))) + 
    geom_bar()+
    xlab("Gender") +
    ylab("Count of people") +
    labs(title = "Genders per province", fill = "Gender") +
     coord_flip() 

Here is a plot shown that shows the distrubution of male and female per province. This graph is nice and could be a goodfit on the datadashboard. This graph can display difference between male and female well when the datadashboard is filtered by a filter function.

data_lifelines_gender <- data_lifelines$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))


ggplot(data_lifelines, mapping = aes(x=Province,y=BMI_T1, color = factor(data_lifelines_gender))) + 
    geom_quasirandom()+
    xlab("Gender") +
    ylab("Count of people") +
    labs(title = "Genders per province", fill = "Gender") 

ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(DEPRESSION_T1))) + 
  geom_bar()+
    xlab("Age") +
    ylab("Count of people") +
    labs(title = "This graph shows if people have depression shown by age", fill = "Depression
         0 = Not depressed
         1 = depressed")

This graph shows the depression per age, it is hard to say something about the depression this is because there are way more awnsers in the 30-50 range so comparing it will be hard. depression definitly has a place on the datadashboard but not in this graph.

ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(BURNOUT_T1))) + 
  geom_bar()+
    xlab("Age") +
    ylab("Count of people") +
    labs(title = "This graph shows if people have a burnout shown by age", fill = "Burnout
         0 = Not Burned out
         1 = In a burnout")

As the previous graph this shows burnout by age, it seems that the people who get burnouts start at 30 and they are quite high between 30-50 meaning that the stress that people are getting pre 30 years old is not enough to cause a burnout. This deffinitly could be a plot that should be shown on the datadashboard.

ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(OSTEOARTHRITIS))) + 
  geom_bar() +
    xlab("Age") +
    ylab("Count of people") +
    labs(title = "This graph shows if people have depression shown by age", fill = "osteoarthirittis
         0 = doesn't have osteoarthirittis
         1 = does have osteoarthirittis")

Osteoarthritis is a degenerative joint disease, in which the tissues in the joint break down over time. It is the most common type of arthritis and is more common in older people. People with osteoarthritis usually have joint pain and, after rest or inactivity, stiffness for a short period of time. In this plot the peak in the 50 years can be ignored this is because there is almost triple the awnsers for it compared to the next ages. However it is shown that this joint disease is indeed more prominent in the older people. This confirms what is said so is not a very interisting plot but we could show a plot per province to see what the difference is between them.

ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(FINANCE_T1))) + 
  geom_bar() +
    xlab("Age") +
    ylab("Count of people") +
    labs(title = "This graph shows the financial situation per age", fill = "Finance situatio
         1 = worst
         10 = best (3500$+ a month)")

This plot shows the financial situation per age, This plot is kind of difficult also because of the big difference between count between 50 and 50+ this finance will be good to show in correlation with other lifestyle factors but this plot in perticular is not interesting enough

ggplot(data_lifelines, aes(SUMOFALCOHOL, FINANCE_T1)) +
    geom_jitter(width = .5, size=1) +
    xlab("Gram of alc per week") +
    ylab("Financial situation") +
    labs(title = "This plot shows the financial situation and alcohol in gram per week")
## Warning: Removed 5495 rows containing missing values or values outside the scale range
## (`geom_point()`).

This plot shows how many grams of alcohol per week is getting drank and how the subjects financial situation is. It is not a really note worthy plot in my opinion so this will not be used in the final datadashboard.

ggplot(data_lifelines, aes(SUMOFALCOHOL, PREGNANCIES)) +
    geom_jitter(width = .5, size=1) +
    xlab("Age") +
    ylab("Count of people") +
    labs(title = "This plot shows the financial situation and alcohol in gram per week")
## Warning: Removed 10390 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(data_lifelines, mapping = aes(x=GENDER1, fill = factor(DEPRESSION_T1))) + 
  geom_bar()

This plot shows depression per gender it’s not that interesting and will not bed added to the final app

bin<-hexbin(data_lifelines$WEIGHT_T1, data_lifelines$HEIGHT_T1, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin, 
     main = "This Hexbin shows height and weight",
     xlab = "Weight in KG",
     ylab = "Height in cm",
     colramp = my_colors,
     legend = FALSE)

A hexbin plot is useful to represent the relationship of 2 numerical variables when you have a lot of data points. Without overlapping of the points, the plotting window is split into several hexbins.

bin_alc<-hexbin(data_lifelines$AGE_T1, data_lifelines$A_SUM_T1, xbins=20)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin_alc, 
     main = "This Hexbin shows height and weight",
     xlab = "AGE",
     ylab = "Alcohol in gram per week",
     colramp = my_colors,
     legend = FALSE)

This hexbin isn’t a really interesting plot in my opinion also because hexbins are hard to understand and it doesn’t show something significant therefore it will not be in the final app.

bin_blp_t1<-hexbin(data_lifelines$WEIGHT_T1, data_lifelines$DBP_T1, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin_blp_t1, 
     xlab = "Weight (kg)",
     ylab = "DBP (mm hg)",
     colramp = my_colors)

bin_blp_t2<-hexbin(data_lifelines$WEIGHT_T2, data_lifelines$DBP_T2, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin_blp_t2, 
     xlab = "Weight (kg)",
     ylab = "DBP (mm hg)",
     colramp = my_colors)

Here two hexbin’s have been made and they show the DBP and WEIGHT on 2 different measuring moments. This is to see if there is a difference between them, in the second measuring moment it seems that the weight has increased but the outliers in the max DBP have gone down. In the first measuring moment there are two counts of 140 dbp that is incredibly high and dangerous.

data_sleepqual_dbp <- na.omit(data_lifelines)
ggplot(data_sleepqual_dbp, aes(x=factor(SLEEP_QUALITY), y=DBP_T1)) + 
    geom_violin() +
    xlab("Sleep quality") +
    ylab("DBP (mm hg)") +
    theme_minimal()

This is a violin plot, A violin plot depicts distributions of numeric data for one or more groups using density curves. The width of each curve corresponds with the approximate frequency of data points in each region. This plot shows DBP for people who have a good sleep quality and people who have bad sleep quality. The plot shows that people that have a better sleep quality being 1 do have a generally lower dbp. This plot will be put in the app that would be interesting with the interactive filters.

ggplot(data_lifelines, aes(x=factor(FINANCE_T1), y=DBP_T1)) + 
    geom_boxplot() +
    xlab("Finance") +
    ylab("DBP (mm hg)") +
    theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Here is a boxplot of dbp for each financial situation. the median show’s average, the average is the lowest for people who have about 750 to spend and in financial situation 7 the meidan is the highest. Meaning more money makes the average DBP rise but it cannot be said for certain ofcourse because there are more factors to look at.

ggplot(data_lifelines, aes(x=!is.na(SPORTS_T1), y=DBP_T1)) + 
    geom_boxplot() +
    xlab("Participates in sports") +
    ylab("Cholesterol (mmol/L)") +
    theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Here is a boxplot that show’s the difference in cholesterol measured in mmol/L between people who do sports and people who don’t do sports. there isn’t a big difference but maybe with the filtering there could be an interesting result.

09-01-2025

added some plotly interactive plots of the already static plots with the following code. Plotly makes ggplot interactive, this will me done by rendering the plot using ggplotly(plot).

 p_age_count <- ggplot(data_lifelines, aes(x = AGE_T1, fill = factor(GENDER))) +
                geom_bar() +
                ylab("Count of People") +
                xlab("Province") +
                labs(fill = "Gender (1 = Male, 2 = Female)") +
                theme_minimal() +
                coord_flip()
            
            ggplotly(p_age_count)

Here is an example of what a plot looks like when rendered with ggplotly. The user can now see the exact count and what gender this line represents. This will be implemented in the app. Making it so the user has a static and interactive plot he can interact with.

netherlands <- rnaturalearth::ne_states(country = "Netherlands", returnclass = "sf")

selected_provinces <- netherlands %>%
  filter(name %in% c("Groningen", "Friesland", "Drenthe"))

tm_shape(selected_provinces) +
    tm_polygons(col = "name", title = "Province", border.col = "black") +
    tm_layout(title = "Selected Dutch Provinces") +
    tm_borders()
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
## This message is displayed once every 8 hours.

I thought a fun way to show where the data came from i’d show the provices with this plot. This shows Friesland, Groningen and Drenthe. This map will be included in the FAQ.

18-01-2025

ggplot(data_lifelines, aes(x=!is.na(SPORTS_T1), y=CHO_T1)) + 
    geom_boxplot() +
    xlab("Participates in sports") +
    ylab("Cholesterol (mmol/L)") +
    theme_minimal()
## Warning: Removed 81 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Here is a boxplot that show’s the difference in people who do and don’t do sports and their cholesterol levels. It is generally known that playing sports could improve your general health. This might be interesting to show in the app.

ggplot(data_lifelines, aes(x=!is.na(SPORTS_T1), y=DBP_T1)) + 
    geom_boxplot() +
    xlab("Participates in sports") +
    ylab("DBP (mm hg)") +
    theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Here is a boxplot that show’s the difference in people who do and don’t do sports and their DBP levels. It is known that playing sports could lower your bloodpressure in general. This might be interesting to show in the app. The filtering could pottentially show some interesting results.

19-01-2025

ggplot(data_lifelines, aes(x = DEPRESSION_T1, y = factor(FINANCE_T1), fill = FINANCE_T1)) +
  geom_density_ridges() +
  theme_ridges() + 
  theme(legend.position = "none")
## Picking joint bandwidth of 0.0677

This ridge plot shows the density distributions of depression scores (DEPRESSION_T1) across different financial levels (FINANCE_T1, grouped as 1–10). Each curve represents a financial group, with the shape and position indicating how depression scores are distributed within that group. A systematic shift or variation in the peaks suggests a potential relationship between financial status and depression, where higher financial levels might be associated with lower or more stable depression scores. The gradient color aids in distinguishing the groups visually. it seems that people in the financial group 4-5 have more people with depression then the higher financial situations. This could be due to multiple reasons like stress from not being able to afford stuff to high work stress to make enough money to get by.

ggplot(data=data_lifelines, aes(x=NSES, group=factor(FINANCE_T1), fill=FINANCE_T1)) +
    geom_density(adjust=1.5) +
    facet_wrap(~FINANCE_T1) 
## Warning: Removed 496 rows containing non-finite outside the scale range
## (`stat_density()`).

NSES = Neighborhood socio-economic status score according to CBS Statistics Netherlands, based on inhabitants’ educational level, income and job prospective. This ridge plot shows the density distributions of NSES scores across different financial levels (FINANCE_T1, grouped as 1–10). Each curve represents a financial group, with the shape and position indicating how NSES scores are distributed within that group. A higher score meaning better neighborhood that could be good for people’s mental and physical health. This plot will be interesting to show in the app.

ggplot(data_lifelines, aes(x = BURNOUT_T1, y = factor(FINANCE_T1), fill = FINANCE_T1)) +
  geom_density_ridges() +
  theme_ridges() + 
  theme(legend.position = "none")
## Picking joint bandwidth of 0.0606

This ridge plot shows the density distributions of burnout scores across different financial levels (FINANCE_T1, grouped as 1–10). Each curve represents a financial group, with the shape and position indicating how burnout scores are distributed within that group. Compared to the depression plot there arent any significant or less then significant peaks. For this reason it will not be interesting to show this plot.

ggplot(data=data_lifelines, aes(x=SUMOFALCOHOL, group=factor(DEPRESSION_T1), fill=DEPRESSION_T1)) +
    geom_density(adjust=1.5) +
    facet_wrap(~DEPRESSION_T1) +
    theme(legend.position = "none")
## Warning: Removed 5495 rows containing non-finite outside the scale range
## (`stat_density()`).

This ridge plot shows the density distributions of how much alcohol someone drinks a day in grams between people who and don’t have depression. As of the basic filter it doesn’t seem that there is a significant difference in alcohol consumption between people who do and don’t have depression. There fore putting it in the app might not be very important but it could serve as a backup

Plan for datadashboard

The plan for the datadashboard will be to make the user able to filter the data. The data can be filterd in 3 ways; gender, what provinces should be shown and what age range should be shown. Then the user can select a plot. These plots will be preselected by me and it will show a couple of lifestyle factors i have selected.

  • METABOLIC_DISORDER T1/T2
  • BURNOUT
  • DEPRESSION
  • MWK_VAL / SPORTS_T1
  • SLEEP_QUALITY
  • SMOKING
  • SUMOFALCOHOL
  • SUMOFKCAL
  • DBP T1/T2 (Diastolic Blood Pressure in mm hg at baseline)
  • HBF T1/T2 (Pulse rate in beats per minute at baseline)
  • FINANCE
  • BMI T1/T3
  • WEIGHT T1/T3
  • HEIGHT T1/T3

i found that these catagories would be interesting at first, then i looked at the different graphs in the analysis above and it showed me that it’s very important to have a good way to visualize these. That being said some stuff also appeared less intersting like bmi weight and height this is because almost every data about provinces shows this. Thus making it not that interesting. Smoking, depression burnout and sum of alcohol could be interesting though. This will show what province best is for example for the least stress in context of a burnout.

What actually got put in the data dashboard.

The focus started to shift to more specialised, an intrest in DBP arose and factors that could influence into a heightend DBP. The reason for this focus is that DBP could go unnoticed for people who don’t pay attention or a ignorant to the symptoms. a HIGH DBP can cause a clogged artery that could be blocked completly in case of a blood cloth this could cause a variety of isues sometimes being deadly like a heartattack, a vein could also tear causing other isues inclueding a aneurism. There for making people aware of felt like a good goal. All of these graphs in the app will have a direct or indirect relation to the DBP.